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Abstract — For many machine learning algorithms such as fe-Nearest 
Neighbor (fc-NN) classifiers and fc-means clustering, often their success 
heavily depends on the metric used to calculate distances between 
different data points. An effective solution for defining such a metric is to 
learn it from a set of labeled training samples. In this work, we propose 
a fast and scalable algorithm to learn a Mahalanobis distance metric. 
The Mahalanobis metric can be viewed as the Euclidean distance metric 
on the input data that have been linearly transformed. By employing 
the principle of margin maximization to achieve better generalization 
performances, this algorithm formulates the metric learning as a convex 
optimization problem and a positive semidefinite (p.s.d.) matrix is the 
unknown variable. Based on an important theorem that a p.s.d. trace- 
one matrix can always be represented as a convex combination of multiple 
rank-one matrices, our algorithm accommodates any differentiable loss 
function and solves the resulting optimization problem using a speciaUzed 
gradient descent procedure. During the course of optimization, the 
proposed algorithm maintains the positive semidefiniteness of the matrix 
variable that is essential for a Mahalanobis metric. Compared with 
conventional methods like standard interior-point algorithms \2ji or the 
special solver used in Large Margin Nearest Neighbor (LMNN) 1231 . 
our algorithm is much more efficient and has a better performance in 
scalability. Experiments on benchmark data sets suggest that, compared 
with state-of-the-art metric learning algorithms, our algorithm can 
achieve a comparable classification accuracy with reduced computational 
complexity. 

Index Terms — Large-margin nearest neighbor, distance metric learn- 
ing, Mahalanobis distance, semidefinite optimization. 



I. Introduction 

In many machine learning problems, the distance metric used 
over the input data has critical impact on the success of a learning 
algorithm. For instance, fc-Nearest Neighbor (fc-NN) classification 
l4l . and clustering algorithms such as fc-means rely on if an ap- 
propriate distance metric is used to faithfully model the underlying 
relationships between the input data points. A more concrete example 
is visual object recognition. Many visual recognition tasks can be 
viewed as inferring a distance metric that is able to measure the 
(dis)similarity of the input visual data, ideally being consistent with 
human perception. Typical examples include object categorization 
1241 and content-based image retrieval |17| , in which a similarity 
metric is needed to discriminate different object classes or relevant 
and irrelevant images against a given query. As one of the most 
classic and simplest classifiers, fc-NN has been applied to a wide 
range of vision tasks and it is the classifier that directly depends 
on a predefined distance metric. An appropriate distance metric is 
usually needed for achieving a promising accuracy. Previous work 
(e.g., 1251 , 1261 ) has shown that compared to using the standard 
Euclidean distance, applying an well-designed distance often can 
significantly boost the classification accuracy of a fc-NN classifier. 
In this work, we propose a scalable and fast algorithm to learn a 
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Mahalanobis distance metric. Mahalanobis metric removes the main 
limitation of the Euclidean metric in that it corrects for correlation 
between the different features. 

Recently, much research effort has been spent on learning a 
Mahalanobis distance metric from labeled data JS), 1231 , 1251 , 1261 . 
Typically, a convex cost function is defined such that a global 
optimum can be achieved in polynomial time. It has been shown 
in the statistical learning theory (221 that increasing the margin 
between different classes helps to reduce the generalization error. 
Inspired by the work of 1231 , we directly learn the Mahalanobis 
matrix from a set of distance comparisons, and optimize it via margin 
maximization. The intuition is that such a learned Mahalanobis 
distance metric may achieve sufficient separation at the boundaries 
between different classes. More importantly, we address the scal- 
ability problem of learning the Mahalanobis distance matrix in the 
presence of high-dimensional feature vectors, which is a critical issue 
of distance metric learning. As indicated in a theorem in II18I , a 
positive semidefinite trace-one matrix can always be decomposed as 
a convex combination of a set of rank-one matrices. This theorem 
has inspired us to develop a fast optimization algorithm that works 
in the style of gradient descent. At each iteration, it only needs to 
find the principal eigenvector of a matrix of size D x D {D is the 
dimensionality of the input data) and a simple matrix update. This 
process incurs much less computational overhead than the metric 
learning algorithms in the literature ||2l, 1231 . Moreover, thanks to 
the above theorem, this process automatically preserves the p.s.d. 
property of the Mahalanobis matrix. To verify its effectiveness and 
efficiency, the proposed algorithm is tested on a few benchmark 
data sets and is compared with the state-of-the-art distance metric 
learning algorithms. As experimentally demonstrated, fc-NN with the 
Mahalanobis distance learned by our algorithms attains comparable 
(sometimes slightly better) classification accuracy. Meanwhile, in 
terms of the computation time, the proposed algorithm has much 
better scalability in terms of the dimensionality of input feature 
vectors. 

We briefly review some related work before we present our work. 
Given a classification task, some previous work on learning a distance 
metric aims to find a metric that makes the data in the same class 
close and separates those in different classes from each other as 
far as possible. Xing et al. 1251 proposed an approach to learn 
a Mahalanobis distance for supervised clustering. It minimizes the 
sum of the distances among data in the same class while maxi- 
mizing the sum of the distances among data in different classes. 
Their work shows that the learned metric could improve clustering 
performance significantly. However, to maintain the p.s.d. property, 
they have used projected gradient descent and their approach has 
to perform a full eigen-decomposition of the Mahalanobis matrix 
at each iteration. Its computational cost rises rapidly when the 
number of features increases, and this makes it less efficient in 
coping with high-dimensional data. Goldberger et al. llT) developed an 
algorithm termed Neighborhood Component Analysis (NCA), which 
learns a Mahalanobis distance by minimizing the leave-one-out cross- 
validation error of the fc-NN classifier on the training set. NCA needs 
to solve a non-convex optimization problem, which might have many 
local optima. Thus it is critically important to start the search from 
a reasonable initial point. Goldberger et al. have used the result of 
linear discriminant analysis as the initial point. In NCA, the variable 
to optimize is the projection matrix. 

The work closest to ours is Large Margin Nearest Neighbor 
(LMNN) 1231 in the sense that it also learns a Mahalanobis distance 
in the large margin framework. In their approach, the distances 
between each sample and its "target neighbors" are minimized while 
the distances among the data with different labels are maximized. 
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A convex objective function is obtained and the resulting problem 
is a semidefinite program (SDP). Since conventional interior-point 
based SDP solvers can only solve problems of up to a few thousand 
variables, LMNN has adopted an alternating projection algorithm for 
solving the SDP problem. At each iteration, similar to [251, also a full 
eigen-decomposition is needed. Our approach is largely inspired by 
their work. Our work differs LMNN (251 in the following: (I) LMNN 
learns the metric from the pairwise distance information. In contrast, 
our algorithm uses examples of proximity comparisons among triples 
of objects (e.g., example i is closer to example j than example k). 
In some applications like image retrieval, this type of information 
could be easier to obtain than to tag the actual class label of each 
training image. Rosales and Fung II6I have used similar ideas on 
metric learning; (2) More importantly, we design an optimization 
method that has a clear advantage on computational efficiency (we 
only need to compute the leading eigenvector at each iteration). The 
optimization problems of 1231 and II6I are both SDPs, which are 
computationally heavy. Linear programs (LPs) are used in |I6| to 
approximate the SDP problem. It remains unclear how well this 
approximation is. 

The problem of learning a kernel from a set of labeled data 
shares similarities with metric learning because the optimization 
involved has similar formulations. Lanckriet et al. II II and Kulis 
et al. IIOI considered learning p.s.d. kernels subject to some pre- 
defined constraints. An appropriate kernel can often offer algorithmic 
improvements. It is possible to apply the proposed gradient descent 
optimization technique to solve the kernel learning problems. We 
leave this topic for future study. 

The rest of the paper is organized as follows. Section|ll]presents the 
convex formulation of learning a Mahalanobis metric. In Section Hill 
we show how to efficiently solve the optimization problem by a spe- 
cialized gradient descent procedure, which is the main contribution 
of this work. The performance of our approach is experimentally 
demonstrated in Section |IV| Finally, we conclude this work in 
Section |V] 

II. Large-Margin Mahalanobis Metric Learning 

In this section, we propose our distance metric learning approach 
as follows. The intuition is to find a particular distance metric for 
which the margin of separation between the classes is maximized. 
In particular, we are interested in learning a quadratic Mahalanobis 
metric. 

Let SLi G 'MP {i = 1, 2, • • • , n) denote a training sample where n 
is the number of training samples and D is the number of features. 
To learn a Mahalanobis distance, we create a set S that contains a 
group of training triplets as <S = {(a^, , a^)}, where a^ and a.j 
come from the same class and a^ belongs to different classes. A 
Mahalanobis distance is defined as follows. Let P G R''*^'^ denote a 
linear transformation and dist be the squared Euclidean distance in 
the transformed space. The squared distance between the projections 
of a.i and a^ writes: 

disty = llP^a, - P"^aj[|^ = (a, ajf PP"^(a, - a,). (1) 

According to the class memberships of a^, aj and afe, we wish to 
achieve distifc > distij and it can be obtained as 

(a, - afe f PP"^ (a, - afe ) > (a, - a^ )^ PP"^ (a, - a^ ) . (2) 

It is not difficult to see that this inequality is generally not a convex 
constrain in P because the difference of quadratic terms in P is 
involved. In order to make this inequality constrain convex, a new 
variable X = PP^ is introduced and used throughout the whole 
learning process. Learning a Mahalanobis distance is essentially 



learning the Mahalanobis matrix X. ^ becomes linear in X. This 
is a typical technique to convexify a problem in convex optimization 
12]. 

A. Maximization of a soft margin 

In our algorithm, a margin is defined as the difference between 
distife and distij, that is, 

Pr = {&i - afe)^X(ai - afe) - (a^ - aj)^X(ai - a^), 
V(a.,a„afe) e<S, r = 1, 2, ■ ■ ■ , |5|. 

Similar to the large margin principle that has been widely used in 
machine learning algorithms such as support vector machines and 
boosting, here we maximize this margin ^ to obtain the optimal 
Mahalanobis matrix X. Clearly, the larger is the margin pr, the better 
metric might be achieved. To enable some flexibility, i.e., to allow 
some inequalities of ^ not to be satisfied, a soft-margin criterion 
is needed. Considering these factors, we could define the objective 
function for learning X as 

maxp,x,s /9 - C5^I,fJ^ Cr, subject to 
X 0,Tr(X) = 1, 

>0, r = l,2,--- 
i&i — afe)^X(ai — afe) — (a^ — aj)^X(ai — a^) > p — £,r, 
V(ai, aj, afe) £ S, 

(4) 

where X ;>= constrains X to be a p.s.d. matrix and Tr(X) denotes 
the trace of X. r indexes the training set S and |iS| denotes the 
size of S. C is an algorithmic parameter that balances the violation 
of ^ and the margin maximization, > is the slack variable 
similar to that used in support vector machines and it corresponds to 
the soft-margin hinge loss. Enforcing Tr(X) — 1 removes the scale 
ambiguity because the inequality constrains are scale invariant. To 
simplify exposition, we define 

A'' = (ai - afe)(ai - afe)^ - (a^ - aj)(ai - a^)^. (5) 

ssss Therefore, the last constraint in ^ can be written as 

{A^,X)>p-^r, r = l,..-,|S|. (6) 

Note that this is a linear constrain on X. Problem ^ is thus a 
typical SDP problem since it has a linear objective function and linear 
constraints plus a p.s.d. conic constraint. One may solve it using off- 
the-shelf SDP solvers like CSDP |T|. However, directly solving the 
problem {4} using those standard interior-point SDP solvers would 
quickly become computationally intractable with the increasing di- 
mensionality of feature vectors. We show how to efficiently solve (|4]l 
in a fashion of first-order gradient descent. 

B. Employment of a differentiable loss function 

It is proved in 1181 that a p.s.d. matrix can always be decomposed 
as a linear convex combination of a set of rank-one matrices. In the 
context of our problem, this means that X = J]]^ OiTji, where is 
a rank-one matrix and Tr(Zi) = 1. This important result inspires us 
to develop a gradient descent based optimization algorithm. In each 
iteration, X can be updated as 

X,+i=X, +a(5X-X0 = X, + QP«, 0<Q<1, (7) 

where (5X is a rank-one and trace-one matrix, p, is the search 
direction. It is straightforward to verify that Tr(Xi+i) = 1, and 
Xi+i hold. This is the starting point of our gradient descent 
algorithm. With this update strategy, the trace-one and positive 
semidefinteness of X is always retained. We show how to calculate 
this search direction in Algorithmic] Although it is possible to use 
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Fig. L The hinge loss, squared hinge loss and Huber loss. 



subgradient methods to optimize non-smooth objective functions, we 
use a differentiable objective function instead so that the optimization 
procedure is simplified (standard gradient descent can be applied). So, 
we need to ensure that the objective function is differentiable with 
respect to the variables p and X. 

Let /(■) denote the objective function and A(-) be a loss function. 
Our objective function can be rewritten as 



/(X,p) = p-C■^A((A^X)-p). 



(8) 



The above problem ^ adopts the hinge loss function that is defined 
as \{z) = max(0, —z). However, the hinge loss is not differentiable 
at the point of z = 0, and standard gradient-based optimization 
cam be applied directly. In order to make standard gradient descent 
methods applicable, we propose to use differentiable loss functions, 
for example, the squared hinge loss or Huber loss functions as 
discussed below. 

The squared hinge loss function can be represented as 



(9) 



A((A^X)-p) = 

0, if ((A^X)-p) >0, 

((A'-,X>-p)', if ((A^X>-p) <0. 

As shown in Fig. [T] this function connects the positive and zero 
segments smoothly and it is differentiable everywhere including the 
point 2 = 0. We also consider the Huber loss function in this work: 



A«A'-,X) 
0, 



(^-((A^x)-p))^ 

4h 



if ((A^x)-p) >h, 

if -h < ((A'-.X) -p) < h, 



(10) 



I -((A'-,X>-p), if ((A^X> -p) < -/i, 



where h is a. parameter whose value is usually between 0.01 and 0.5. 
A Huber loss function with h = 0.5 is plotted in Fig. [T] There are 
three different parts in the Huber loss function, and they together 
form a continuous and differentiable function. This loss function 
approaches the hinge loss curve when h ^ 0. Although the Huber 
loss is more complicated than the squared hinge loss, its function 
value increases linearly with the value of (A*^, X) — p. Hence, when 
a training set contains outliers or samples heavily contaminated by 
noise, the Huber loss might give a more reasonable (milder) penalty 
than the squared hinge loss does. We discuss both loss functions 
in our experimental study. Again, we highlight that by using these 



Algorithm 1 The proposed optimization algorithm. 
Input: 

• The maximum number of iterations K; 

• A pre-set tolerance value e (e.g., 10~^). 

1 Initialize: Xo such that Tr(Xo) = l,rank(Xo) = 1; 

2 for k = 1,2, - ■ ■ ,K do 

■ Compute pfe by solving the subproblem 
Pfe = argmax/(Xfc_i, p); 

p>0 

■ Compute Xfc by solving the problem 

Xfc = argmax /(X,pfe); 

X:>=0,Tr{X) = l 

■ if A: > 1 and |/(Xfc,pfe) - /(Xfe_i,pfe)| < e and 
|/(Xfc_i,pfe) - /(Xfc_i,pfe_i)| < £ then 

1^ break (converged); 

Output: The final p.s.d. matrix Xfe. 



Algorithm 2 Compute X^ in the proposed algorithm. 
Input: 

• Pfe and Xi which is an initial approximation of X^; 

• The maximum number of iterations J. 
1 for j = 1, 2, • • • , J do 

• Compute Vi that corresponds to the largest eigenvalue li 
of the matrix V/(Xi,pfc); 

■ if li < e then 
1^ break (converged); 

■ Let the search direction be pi = VivJ — Xi; 

■ Set Xi-i-i = Xi + api. Here a is found by line search; 
Output: Set Xfe = Xi. 



two loss functions, the cost function /(X, p) that we are going to 
optimization becomes differentiable with respect to both X and p. 

HL A SCALABLE AND FAST OPTIMIZATION ALGORITHM 

The proposed algorithm maximizes the objective function itera- 
tively, and in each iteration the two variables X and p are optimized 
alternatively. Note that the optimization in this alternative strategy 
retains the global optimum because /(X,p) is a convex function 
in both variables (X, p) and (X, p) are not coupled together. We 
summarize the proposed algorithm in Algorithm Q] Note that pfe 
is a scalar and Line 3 in Algorithm Q] can be solved directly by 
a simple one-dimensional maximization process. However, X is a 
p.s.d. matrix with size of D x D. Recall that D is the dimensionality 
of feature vectors. The following section presents how X is efficiently 
optimized in our algorithm. 

A. Optimizing for the Mahalanobis matrix X^ 

Let -p = {X G R^'^-" : X ^ 0, Tr(X) = 1} be the domain 
in which a feasible X lies. Note that "P is a convex set of X. As 
shown in Line 4 in Algorithm [T] we need to solve the following 
maximization problem: 



max /(X,pfe) 



(11) 



where pk is the output of Line 3 in Algorithm[T] Our algorithm offers 
a simple and efficient way for solving this problem by explicitly 
maintaining the positive semidefiniteness property of the matrix X. 
It needs only compute the largest eigenvalue and the corresponding 
eigenvector whereas most previous approaches such as the method 
of 1231 require a full eigen-decomposition of X. Their computational 



IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 30, NO. 9, SEPTEMBER 200X 



4 



complexities are 0{D^) and 0{D^), respectively. When D is large, 
this computational complexity difference could be significant. 

Let V/(X,pfc) be the gradient matrix of /(•) with respect to X 
and a be the step size for updating X. Recall that we update X in 
such a way that X^+i = (1 — Q)Xi+i + Q^X, where rank(5X) — 1 
and Tr(5X) = 1. To find the 5X that satisfies these constraints and in 
the meantime can best approximate the gradient matrix V/(X, pfc), 
we need to solve the following optimization problem: 

max (V/(X,pfc),5X) 
sx. 

subject to rank(5X) = 1, Tr(5X) = 1. (12) 

The optimal (5X* is exactly where v is the eigenvector of 

V/(X, pk) that corresponds to the largest eigenvalue. The constraints 
says that 5X. is a outer product of a unit vector: (5X = vv^ with 
|v||2 = 1. Here || ■ ||2 is the Euclidean norm. Problem l ll2b can 
then be written as: maxv v^[V/(X, pfe)]v, subject to ||v||2 = 1. It 
is clear now that an eigen-decomposition gives the solution to the 
above problem. 

Hence, to solve the above optimization, we only need to compute 
the leading eigenvector of the matrix V/(X, p/.). Note that X still 
retains the properties of X ^ 0, Tr(X) = 1 after applying this 
process. 

Clearly, a key parameter of this optimization process is a which 
implicitly decides the total number of iterations. The computational 
overhead of our algorithm is proportional to the number of iterations. 
Hence, to achieve a fast optimization process, we need to ensure that 
in each iteration the a can lead to a sufficient reduction on the value 
of /. This is discussed in the following part. 

B. Finding the optimal step size a 

We employ the backtracking line search algorithm in 1151 to 
identify a suitable a. It reduces the value of a until the Wolfe 
conditions are satisfied. As shown in Algorithm[2] the search direction 
is Pi = VivJ — Xi. The Wolfe conditions that we use are 

/(X, + ap,, pO < /(X,, p,) + ciapT V/(Xi, Pi), 

jpTv/(x. + Qp„p,)| <c2|pTv/(x„pO|, (13) 

where < ci < C2 < 1. The result of backtracking line search is 
an acceptable a which can give rise to sufficient reduction on the 
function value of /(■). We show in the experiments that with this 
setting our optimization algorithm can achieve higher computational 
efficiency than some of the existing solvers. 

IV. Experiments 

The goal of these experiments is to verify the efficiency of 
our algorithm in achieving comparable (or sometimes even better) 
classification performances with a reduced computational cost. We 
perform experiments on 10 data sets described in Table |I] For some 
data sets, PCA is performed to remove noises and reduce the dimen- 
sionality. The metric learning algorithms are then run on the data sets 
pre-processed by PCA. The Wine, Balance, Vehicle, Breast-Cancer 
and Diabetes data sets are obtained from UCI Machine Learning 
Repository ||T4|, and USPS, MNIST and Letter are from LibSVM 
(3) For MNIST, we only use its test data in our experiment. The 
ORLface data is from att researclQ and Twin-Peaks is downloaded 
from L. van der Maaten's websitfl The Face and Background classes 
(435 and 520 images respectively) in the image retrieval experiment 
are obtained from the Caltech-101 object database ||6|. In order to 

'http://www.uk.research.att.com/facedatabase.html 
^ http ://ticc . uvt.nl/lvdrmaaten/ 



perform statistics analysis, the ORLface, Twin-Peaks, Wine, Balance, 
Vehicle, Diabetes and Face-Background data sets are randomly split 
as 10 pairs of train/validation/test subsets and experiments on those 
data set are repeated 10 times on each split. 

The fc-NN classifier with the Mahalanobis distance learned by 
our algorithm (termed SDPMetric in short) is compared with the 
fc-NN classifiers using a simple Euclidean distance ("Euclidean" in 
short) and that learned by the Large Margin Nearest Neighbor in 
(23) (LMNIV0 in short). Since Weinberger et al. 1231 has shown that 
LMNN obtains the classification performance comparable to support 
vector machines on some data sets, we focus on the comparison 
between our algorithm and LMNN, which is considered as the state- 
of-the-art. To prepare the training triplet set <S, we apply the 3-NN 
method to these data sets and generate the training triplets for our 
algorithm. The training data sets for LMNN is also generated using 
3-NN, except that the Twin-peaks and ORLface are applied with the 
1-NN method. Also, the experiment compares the two variants of 
our proposed SDPMetric, which use the squared hinge loss (denoted 
as SDPMetric-S) and the Huber loss(SDPMetric-H), respectively. We 
split each data set into 70/15/15% randomly and refer to those split 
sets as training, cross validation and test sets except pre-separated 
data sets (Letter and USPS) and Face-Background which was made 
for image retrieval. Following 1231 . LMNN uses 85/15% data for 
training and testing. The training data is also split into 70/15% in 
LMNN for cross validation to be consistent with our SDPMetric. 
Since USPS data set has been split into training/test already, only 
the training data are divided into 70/15% for training and validation. 
The Letter data set is separated according to Hsu and Lin ||9). Same 
as in (23), PCA is applied to USPS, MNIST and ORLface to reduce 
the dimensionality of feature vectors. 

The following experimental study demonstrates that our algorithm 
achieves slightly better classification accuracy rates with a much 
less computational cost than LMNN on most of the tested data 
sets. The detailed test error rates and timing results are reported in 
Tables HH and IIIII As we can see, the test error rates of SDPMetric- 
S are comparable to those of LMNN. SDPMetric-H achieves lower 
misclassification error rates than LMNN and the Euclidean distance 
on most of data sets except Face-Background data (which is treated 
as an image retrieval problem) and MNIST, on which SDPMetric- 
S achieves a lower error rate. Overall, we can conclude that the 
proposed SDPMetric either with squared hinge loss or Huber loss is 
at least comparable to (or sometimes slightly better than) the state- 
of-the-art LMNN method in terms of classification performance. 

Before reporting the timing result on these benchmark data sets, we 
compared our algorithm (SDPMetric-H) with two convex optimiza- 
tion solvers, namely, SeDuMi 1201 and SDPT3 1211 which are used as 
internal solvers in the disciplined convex programming software CVX 
(8). Both SeDuMi and SDPT3 use interior-point based methods. To 
perform eigen-decomposition, our SDPMetric uses ARPACK 1191 , 
which is designed to solve large scale eigenvalue problems. Our 
SDPMetric is implemented in standard C/C++. Experiments have 
been conducted on a standard desktop. We randomly generated 
1,000 training triplets and gradually increase the dimensionality 
of feature vectors from 20 to 100. Fig. [2] illustrates computa- 
tional time of ours, CVX/SeDuMi and CVX/SDPT3. As shown, the 
computational load of our algorithm almost keeps constant as the 
dimensionality increases. This might be because the proportion of 
eigen-decomposition's CPU time does not dominate with dimensions 

^In our experiment, we have used the implementation of LMNN's authors. 
Note that to be consistent with the setting in [23] , LMNN here also uses the 
"obj=l" option and updates the projection matrix to speed up its computation. 
If we update the distance matrix directly to get global optimum, LMNN would 
be much more slower due to full eigen-decomposition at each iteration. 
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TABLE I 

The ten benchmark DATA SETS USED IN THE EXPERIMENT. MISSING ENTRIES IN "DIMENSION AFTER PCA" INDICATE NO PCA PROCESSING. 





# training 


# validation 


# test 


dimension 


dimension after PCA 


# classes 


# runs 


# triplets for SDPMetric 


USPSPCA 


5,833 


1,458 


2,007 


256 


60 


10 


1 


52,497 


USPS 


5,833 


1,458 


2,007 


256 




10 


1 


5,833 


MNISTpcA 


7,000 


1,500 


1,500 


784 


60 


10 


1 


54,000 


MNIST 


7,000 


1,500 


1,500 


784 




10 


1 


7,000 


Letter 


10,500 


4,500 


5,000 


16 




26 


1 


94,500 


ORLface 


280 


60 


60 


2,576 


42 


40 


10 


280 


Twin-Peaks 


14,000 


3,000 


3,000 


3 




11 


10 


14,000 


Wine 


126 


26 


26 


13 




3 


10 


1,134 


Balance 


439 


93 


93 


4 




3 


10 


3,951 


Vehicle 


593 


127 


126 


18 




4 


10 


5,337 


Breast-Cancer 


479 


102 


102 


10 




2 


10 


4,311 


Diabetes 


538 


115 


115 


8 




2 


10 


4,842 


Face-Background 


472 


101 


382 


100 




2 


10 


4,428 



varying from 20 to 100 in SDPMetric on this data set. In contrast, 
the computational loads of CVX/SeDuMi and CVX/SDPT3 increase 
quickly in this course. In the case of the dimension of 100, the 
difference on CPU time can be as large as 800 ~ 1000 seconds. This 
shows the inefficiency and poor scalabihty of standard interior-point 
methods. Secondly, the computational time of LMNN, SDPMetric- 
S and SDPMetric-H on these benchmark data sets are compared in 
Tablelllll As shown, LMNN is always slower than the proposed SDP- 
Metric which converges very fast on these data sets. Especially, on 
the Letter and Twin-Peaks data sets, SDPMetric shows significantly 
improved computational efficiency. 

Face-Background data set consists of the two object classes, Face- 
easy and Background-Google in (6|, as a retrieval problem. The 
images in the class of Background-Google are randomly collected 
from the Internet and they are used to represent the non-target class. 
For each image, a number of interest regions are identified by the 
Harris-Affine detector |I3I and the visual content in each region 
is characterized by the SIFT descriptor 1121 . A codebook of size 
100 is created by using fc-means clustering. Each image is then 
represented by a 100-dimensional histogram vector containing the 
number of occurrences of each visual word. We evaluate retrieval 
accuracy using each facial image in a test subset as a query. For each 
compared metric, the accuracy of the retrieved top 1 to 20 images are 
computed, which is defined as the ratio of the number of facial images 
to the total number of retrieved images. We calculate the average 
accuracy of each test subset and then average over the whole 10 
test subsets. Fig. [3] shows the retrieval accuracies of the Mahalanobis 




70 80 90 100 



distances learned by Euclidean, LMNN and SDPMetric. Clearly we 
can observe that SDPMetric-H and SDPMetric-S consistently present 
higher retrieval accuracy values, which again verifies their advantages 
over the LMNN method and Euclidean distance. 
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Fig. 2. Computational time versus the dimensionality of feature vectors. 



Fig. 3. Retrieval peiformances of SDPMetric-S, SDPMetric-H, LMNN and 
the Euclidean distance. The curves of SDPMetric-S and SDPMetric-H are 
very close. 



V. CONCLUSION 

We have proposed a new algorithm to demonstrate how to ef- 
ficiently learn a Mahalanobis distance metric with the principle of 
margin maximization. Enlightened by the important theorem on p.s.d. 
matrix decomposition in |18l, we have designed a gradient descent 
method to update the Mahalanobis matrix with cheap computational 
loads and at the same time, the p.s.d. property of the learned matrix 
is maintained during the whole optimization process. Experiments 
on benchmark data sets and the retrieval problem verify the supe- 
rior classification performance and computational efficiency of the 
proposed distance metric learning algorithm. 

The proposed algorithm may be used to solve more general SDP 
problems in machine learning. To look for other applications is one 
of the future research directions. 
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